|
Determining the number of clusters in a data set, a quantity often labeled ''k'' as in the ''k''-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular ''k''-means, ''k''-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as ''k'' that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of ''k'' is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing ''k'' without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when ''k'' equals the number of data points, ''n''). Intuitively then, ''the optimal choice of ''k'' will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster''. If an appropriate value of ''k'' is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision. == Rule of thumb == One simple rule of thumb sets the number to : with ''n'' as the number of objects (data points). 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Determining the number of clusters in a data set」の詳細全文を読む スポンサード リンク
|